Prediction Models: Lecture

BDSI 2025; University of Michigan

Phil Boonstra

Q: Which puzzle-building strategy is best, and how good is it?

Start with edge pieces

Sort and build high contrast/color regions

Sort into knob-and-hole combinations

How to measure puzzle strategy success/failure?

  • time to completion
  • average # tries to fitting piece
  • proportion of first tries that fit

Defining terms

model selection is “estimating the performance of different models in order to choose the best one”

  • identify the best puzzle building strategy from among the candidates on a relative basis

model assessment is, “having chosen a final model, estimating its prediction error…on new data”

  • measuring how good your chosen puzzle building strategy is

Hastie, Tibshirani, and Friedman (2009)

Two main considerations

  1. Using data honestly
  2. Measuring error

Using data honestly

Notation

Each observation’s outcome \(Y\) (continuous or categorical) is to be predicted with function of auxiliary knowledge, i.e. covariates, from that observation \(X=x\), denoted as \(\hat Y(x)\)

Want \(\hat Y(x)\) to be close to \(Y\), but need to define what is meant by “close”

Use ‘mean squared prediction error’ (MSPE) for now: \((Y - \hat Y(x))^2\)

Numeric example (binary outcome)

  • Generate \(Y\) from logistic regression model: \(\Pr(Y = 1|X=x) = 1/(1+\exp(-\alpha+ x\beta))\)
  • One observation consists of \(\{Y, x\}\)
  • \(n = 100\) observations each in training and validation
  • \(p = 20\) covariates, distributed as independent normal random variables
  • \(p/2 = 10\) of coefficients equal to 0.25; remaining equal to 0
  • Expected prevalence of about 0.6
  • \(\hat Y(x)=\hat{\Pr}(Y=1|X=x) = 1/(1+\exp(-\hat\alpha+ x\hat\beta))\)

set.seed(7201969);
#n = 400 observations; 1:1 for training:validation
n <- 400;
training_subset <- 1:round(n / 2);
validation_subset <- (round(n / 2) + 1):n;
#p covariates
p <- 20;
#baseline prevalence approximately 0.6
alpha <- log(0.6/0.4); #0.405
#normally distributed covariates, correlation 0.1
x <- matrix(rnorm(n*p), 
            nrow = n);
x <- x %*% chol(0.1 + diag(0.9, p))
colnames(x) <- glue("x{1:p}")
#half of covariates have odds ratios exp(0.25) 
beta <- c(rep(0.25, floor(p/2)), numeric(ceiling(p/2)));
true_probs <- 1/(1+exp(-alpha - drop(x%*%beta)));
y <- rbinom(n, 1, true_probs);
all_data <- bind_cols(y = y, data.frame(x));
full_fmla <- 
  glue("y~{glue_collapse(glue('x{1:p}'),sep='+')}") %>%
  as.formula();

Eight strategies considered:

  • True model (‘(truth)’; for benchmarking)
  • Intercept only (‘null’; not considered for selection)
  • All covariates (‘full’)
  • Forward selection (‘forward’): start with the null model, incrementally add in covariates that seem to improve fit
  • Forward selection with max of 1 variable / 25 observations in training data (‘forward_pruned’)
  • Backward selection (‘backward’): start will the full model, incrementally subtract covariates that seem to harm fit
  • Logistic regression with firth penalty (‘firth’): penalized logistic regression
  • Logistic regression with lasso penalty
full_model <- glm(full_fmla,
                  data = all_data,
                  subset = training_subset, 
                  family = "binomial");
null_model <- glm(y ~ 1,
                  data = all_data,
                  subset = training_subset, 
                  family = "binomial");

#forward selection (using MASS package)
forward_model <- 
  stepAIC(null_model,
          scope = list(upper = full_fmla),
          direction = "forward",
          trace = F);

#pruned forward selection: max of 1 coefficient per 25 training observations 
forward_pruned_model <- 
  stepAIC(null_model,
          scope = list(upper = full_fmla),
          direction = "forward",
          steps = max(1, floor(length(training_subset) / 25)),
          trace = F);

#backward selection
backward_model <- 
  stepAIC(full_model,
          direction = "backward",
          trace = F);

# Firth logistic regression
firth_model <-
  logistf(full_fmla, 
          data = slice(all_data, training_subset))

# Lasso regression
lasso_model <- 
  cv.glmnet(full_fmla, 
            data = all_data,
            subset = training_subset, 
            family = "binomial", alpha = 1)

Selection

tidy(full_model) %>% 
  left_join(tibble(term = glue("x{1:floor(p/2)}"), 
                   nonzero = 1)) %>%
  mutate(nonzero = replace_na(nonzero, 0)) %>%
  filter(term != "(Intercept)") %>%
  arrange(p.value) %>%
  print(n = Inf);
# A tibble: 20 × 6
   term   estimate std.error statistic p.value nonzero
   <glue>    <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
 1 x6      0.427       0.178    2.40    0.0163       1
 2 x3      0.391       0.176    2.22    0.0262       1
 3 x4      0.342       0.189    1.81    0.0698       1
 4 x5      0.324       0.178    1.81    0.0698       1
 5 x7      0.288       0.167    1.72    0.0847       1
 6 x11    -0.327       0.207   -1.58    0.114        0
 7 x9      0.281       0.192    1.46    0.143        1
 8 x1      0.192       0.154    1.25    0.212        1
 9 x14    -0.170       0.166   -1.02    0.307        0
10 x13    -0.160       0.167   -0.962   0.336        0
11 x8      0.168       0.179    0.941   0.346        1
12 x18    -0.154       0.176   -0.874   0.382        0
13 x20    -0.133       0.180   -0.739   0.460        0
14 x19     0.0909      0.165    0.550   0.582        0
15 x10     0.0951      0.174    0.547   0.584        1
16 x17     0.106       0.196    0.542   0.588        0
17 x16     0.0938      0.173    0.542   0.588        0
18 x2      0.0286      0.170    0.168   0.866        1
19 x12    -0.0104      0.200   -0.0521  0.958        0
20 x15    -0.00586     0.175   -0.0334  0.973        0

tidy(forward_model) %>% 
  left_join(tibble(term = glue("x{1:floor(p/2)}"), 
                   nonzero = 1)) %>%
  mutate(nonzero = replace_na(nonzero, 0)) %>%
  filter(term != "(Intercept)") %>%
  arrange(p.value) %>%
  print(n = Inf);
# A tibble: 6 × 6
  term   estimate std.error statistic p.value nonzero
  <glue>    <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
1 x6        0.432     0.162      2.66 0.00773       1
2 x3        0.384     0.168      2.29 0.0218        1
3 x4        0.359     0.170      2.11 0.0352        1
4 x5        0.344     0.168      2.04 0.0409        1
5 x11      -0.324     0.192     -1.69 0.0908        0
6 x7        0.227     0.152      1.49 0.136         1

tidy(forward_pruned_model) %>% 
  left_join(tibble(term = glue("x{1:floor(p/2)}"), 
                   nonzero = 1)) %>%
  mutate(nonzero = replace_na(nonzero, 0)) %>%
  filter(term != "(Intercept)") %>%
  arrange(p.value) %>%
  print(n = Inf);
# A tibble: 6 × 6
  term   estimate std.error statistic p.value nonzero
  <glue>    <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
1 x6        0.432     0.162      2.66 0.00773       1
2 x3        0.384     0.168      2.29 0.0218        1
3 x4        0.359     0.170      2.11 0.0352        1
4 x5        0.344     0.168      2.04 0.0409        1
5 x11      -0.324     0.192     -1.69 0.0908        0
6 x7        0.227     0.152      1.49 0.136         1

tidy(backward_model) %>%
  left_join(tibble(term = glue("x{1:floor(p/2)}"), 
                   nonzero = 1)) %>%
  mutate(nonzero = replace_na(nonzero, 0)) %>%
  filter(term != "(Intercept)") %>%
  arrange(p.value) %>%
  print(n = Inf);
# A tibble: 6 × 6
  term   estimate std.error statistic p.value nonzero
  <glue>    <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
1 x6        0.432     0.162      2.66 0.00773       1
2 x3        0.384     0.168      2.29 0.0218        1
3 x4        0.359     0.170      2.11 0.0352        1
4 x5        0.344     0.168      2.04 0.0409        1
5 x11      -0.324     0.192     -1.69 0.0908        0
6 x7        0.227     0.152      1.49 0.136         1

tibble(term = firth_model$terms, estimate = firth_model$coefficients, p.value = firth_model$prob) %>%
  left_join(tibble(term = glue("x{1:floor(p/2)}"), 
                   nonzero = 1)) %>%
  mutate(nonzero = replace_na(nonzero, 0)) %>%
  filter(term != "(Intercept)") %>%
  arrange(p.value) %>%
  print(n = Inf);
# A tibble: 20 × 4
   term   estimate p.value nonzero
   <glue>    <dbl>   <dbl>   <dbl>
 1 x6      0.377    0.0211       1
 2 x3      0.344    0.0337       1
 3 x5      0.287    0.0828       1
 4 x4      0.302    0.0840       1
 5 x7      0.250    0.105        1
 6 x11    -0.287    0.133        0
 7 x9      0.246    0.168        1
 8 x1      0.171    0.234        1
 9 x14    -0.150    0.336        0
10 x13    -0.144    0.357        0
11 x8      0.147    0.380        1
12 x18    -0.138    0.405        0
13 x20    -0.119    0.482        0
14 x17     0.0970   0.598        0
15 x19     0.0819   0.599        0
16 x10     0.0825   0.613        1
17 x16     0.0820   0.614        0
18 x2      0.0271   0.866        1
19 x15    -0.00676  0.967        0
20 x12    -0.00648  0.973        0

tidy(lasso_model$glmnet.fit, return_zeros = TRUE) %>% filter(lambda == lasso_model$lambda.1se) %>%
  left_join(tibble(term = glue("x{1:floor(p/2)}"), 
                   nonzero = 1)) %>%
  mutate(nonzero = replace_na(nonzero, 0)) %>%
  filter(term != "(Intercept)") %>%
  arrange(-abs(estimate)) %>%
  print(n = Inf);
# A tibble: 20 × 6
   term    step estimate lambda dev.ratio nonzero
   <glue> <dbl>    <dbl>  <dbl>     <dbl>   <dbl>
 1 x3         5   0.120  0.0780    0.0450       1
 2 x4         5   0.104  0.0780    0.0450       1
 3 x6         5   0.0582 0.0780    0.0450       1
 4 x5         5   0.0442 0.0780    0.0450       1
 5 x1         5   0      0.0780    0.0450       1
 6 x2         5   0      0.0780    0.0450       1
 7 x7         5   0      0.0780    0.0450       1
 8 x8         5   0      0.0780    0.0450       1
 9 x9         5   0      0.0780    0.0450       1
10 x10        5   0      0.0780    0.0450       1
11 x11        5   0      0.0780    0.0450       0
12 x12        5   0      0.0780    0.0450       0
13 x13        5   0      0.0780    0.0450       0
14 x14        5   0      0.0780    0.0450       0
15 x15        5   0      0.0780    0.0450       0
16 x16        5   0      0.0780    0.0450       0
17 x17        5   0      0.0780    0.0450       0
18 x18        5   0      0.0780    0.0450       0
19 x19        5   0      0.0780    0.0450       0
20 x20        5   0      0.0780    0.0450       0

predict_models <- 
  list(null = null_model, 
       full = full_model, 
       forward = forward_model, 
       forward_pruned = forward_pruned_model, 
       backward = backward_model, 
       firth = firth_model, 
       lasso = lasso_model) %>%
  map(predict, newdata = all_data, type = 'resp') %>%
  map(as.numeric) %>%
  bind_cols() %>%
  mutate(true_probs = true_probs,
         y = y, 
         training = row_number() %in% training_subset) %>%
  pivot_longer(c(true_probs, null:lasso),
               names_to = "model_name") %>%
  mutate(model_name = factor(model_name) %>% fct_inorder())

Training MSPEs

\(\dfrac{1}{200}\sum_{i=1}^{200} (Y_i - \hat Y(x_i))^2\)

predict_models %>% 
  filter(training) %>%
  group_by(model_name) %>% 
  summarize(mspe = mean((y - value)^2))
# A tibble: 8 × 2
  model_name      mspe
  <fct>          <dbl>
1 true_probs     0.215
2 null           0.245
3 full           0.200
4 forward        0.209
5 forward_pruned 0.209
6 backward       0.209
7 firth          0.200
8 lasso          0.230

Full model has smallest MSPE in training subset

Validation MSPEs

\(\dfrac{1}{200}\sum_{i=201}^{400} (Y_i - \hat Y(x_i))^2\)

predict_models %>% 
  filter(!training) %>%
  group_by(model_name) %>% 
  summarize(mspe = mean((y - value)^2))
# A tibble: 8 × 2
  model_name      mspe
  <fct>          <dbl>
1 true_probs     0.205
2 null           0.244
3 full           0.226
4 forward        0.229
5 forward_pruned 0.229
6 backward       0.229
7 firth          0.225
8 lasso          0.234

Except for true and null models, all MSPEs increase

Firth model has smallest MSPE in validation subset

  • An aside: even though it has the smallest MSPE, is it still good in an absolute sense? What MSPE do we get from using \(\hat Y(x_i)\equiv 0.5\)?

Assuming we report firth model as the model, is this the MSPE we should expect in the future?

Simulation study

  • \(n = 200;200;200\) observations in training;validation;testing
  • Same generating model as previously but repeat 500 times
  • In each simulated dataset, only three models are taken to the testing step: the true and null models (for benchmarking) and whichever other model has best validation MSPE
observed_results <-
  all_results %>%
  group_by(sim) %>%
  #keep from the test step only the method we would have selected
  filter(model_name %in% c("(truth)","null") | 
           step != "testing" | 
           mspe_ranking == min(mspe_ranking)) %>%
  ungroup();

model_colors = c("black",brewer.pal(7, "Dark2"));
ggplot(observed_results) + 
  geom_boxplot(aes(x = step, 
                   y = mspe, 
                   color = model_name), 
               fill = "#AAAAAAAA",
               outlier.shape = NA,
               varwidth = FALSE) + 
  geom_hline(yintercept = 0.25) + 
  scale_color_manual(values = model_colors) + 
  labs(x = "", y = "MSPE",  color = "Strategy") + 
  theme(text = element_text(size = 22), 
        legend.position = "top");

change_in_optimism <- 
  observed_results %>%
  dplyr::select(-row_number, -absolute, -zero_one, -deviance) %>%
  pivot_wider(names_from = step, values_from = mspe) %>%
  mutate(validation_minus_training = validation - training, 
         testing_minus_validation = testing - validation) %>%
  dplyr::select(-training, -validation, -testing) %>% 
  pivot_longer(cols = contains("minus"), names_to = "delta", values_to = "value") %>%
  group_by(sim) %>%
  filter(model_name %in% c("(truth)","null") | 
          delta == "validation_minus_training" | 
          mspe_ranking == min(mspe_ranking)) %>%
  ungroup() %>%
  mutate(delta = factor(delta,
                        levels = c("validation_minus_training",
                                   "testing_minus_validation"), 
                        labels = c("training to validation", 
                                   "validation to testing")));
ggplot(change_in_optimism) + 
  geom_boxplot(aes(x = delta, 
                   y = value,
                   color = model_name), 
               fill = "#AAAAAAAA",
               outlier.shape = NA,
               varwidth = FALSE) + 
  scale_color_manual(values = model_colors) + 
  labs(x = "",
       y = "Optimism (change in MSPE)", 
       color = "Strategy") + 
  theme(text = element_text(size = 22), 
        legend.position = "top");

Model selection, conducted properly, adjusts for optimism in training

Model assessment, conducted properly, adjusts for regression to the mean; potential for optimism increases with variability of method

Measuring error

When \(Y\) is binary, there are several ways of thinking about ‘error’

  1. overall prediction error
  2. calibration
  3. discrimation

Model assessment can be based on any sensible, quantifiable error function

Steyerberg et al. (2010)

Overall prediction error

“How close are the actual outcomes to the predicted outcomes?”

\(\hat Y(x)=\hat{\Pr}(Y=1|X=x)\)

MSPE or Brier score: \(=(Y - \hat Y(x))^2\)

Absolute: \(=|Y - \hat Y(x)|\)

0-1: \((1-Y)\times 1_{[ \hat Y(x)\geq0.5]} + Y\times 1_{[ \hat Y(x)<0.5]}\)

Deviance: \(-2(1-Y)\log[1- \hat Y(x)] -2Y\log \hat Y(x)\)

Error functions against \(\hat Y(x)\) when \(Y=1\)

prob_seq <- seq(0.01, 1, by = 0.001);#
ggplot() + 
  geom_path(aes(x = prob_seq, y = (1-prob_seq)^2, color = "1Quadratic"), size = 1) + 
  geom_path(aes(x = prob_seq, y = abs(1-prob_seq), color = "2Absolute"), size = 1) +
  geom_path(aes(x = prob_seq, y = 1*(prob_seq < 0.5), color = "30-1"), size = 1) +
  geom_path(aes(x = prob_seq, y = -2 * log(prob_seq), color = "4Deviance"), size = 1) + 
  coord_cartesian(ylim = c(0, 2)) + 
  scale_color_manual(labels = c("Quadratic (MSPE)", "Absolute", "0-1", "Deviance"), 
                     values = c("#E41A1C","#377EB8","#4DAF4A","#984EA3")) + 
  labs(x = expression(hat(Y(x))),
       y = "", 
       color = "Loss") + 
  theme(text = element_text(size = 22), 
        legend.position = "top");

predict_models %>%
  filter(!training) %>%
  group_by(model_name) %>%
  summarize(mspe = mean((y - value)^2),
            abs = mean(abs(y - value)),
            zero_one = mean((1-y) * (value >= 0.5) + y * (value < 0.5)),
            deviance = -2*mean((1-y)*log(1-value) + y*log(value)))
# A tibble: 8 × 5
  model_name      mspe   abs zero_one deviance
  <fct>          <dbl> <dbl>    <dbl>    <dbl>
1 true_probs     0.205 0.398    0.335     1.19
2 null           0.244 0.489    0.42      1.36
3 full           0.226 0.427    0.36      1.30
4 forward        0.229 0.438    0.355     1.30
5 forward_pruned 0.229 0.438    0.355     1.30
6 backward       0.229 0.438    0.355     1.30
7 firth          0.225 0.433    0.36      1.28
8 lasso          0.234 0.476    0.41      1.32

Calibration

“Among observations with predicted prevalence of X%, is the true prevalence close to X%?”

  • overall error functions capture elements of calibration
  • Hosmer-Lemeshow test: group observations based upon \(\hat Y(x)\), compare \(\sum_i \hat Y(x_i)\) to \(\sum_i Y_i\)

Discrimination

“Did the observations in which the outcome occured have a higher predicted risk than the observations in which the outcome did not occur?”

Sensitivity: probability of predicting \(\hat Y(x)=1\) given that, in truth, \(Y(x)=1\)

  • it is not the probability that \(Y(x)=1\) given that we’ve predicted \(\hat Y(x)=1\)

Specificity: probability of predicting \(\hat Y(x)=0\) given that, in truth, \(Y(x)=0\)

  • same caution as above

Receiver operator characteristic (ROC) curve: plot of \(\hat{\mathrm{sens}}(t)\) versus \(1-\hat{\mathrm{spec}}(t)\) for \(t\in[0,1]\), where

  • \(\hat{\mathrm{sens}}(t) = \dfrac{\sum_{i:Y_i=1}1_{[\hat Y(x) > t]}}{\sum_{i:Y_i=1} 1}\) and

  • \(\hat{\mathrm{spec}}(t) = \dfrac{\sum_{i:Y_i=0}1_{[\hat Y(x) \leq t]}}{\sum_{i:Y_i=0} 1}\)

concordance (\(c\)) index : \(\dfrac{\sum_{i,j:Y_i=0,Y_j=1} 1_{[\hat Y(x_i) < \hat Y(x_j)]} + 0.5\times 1_{[\hat Y(x_i) = \hat Y(x_j)]}}{ \sum_{i,j:Y_i=0,Y_j=1} 1}\)

# use pROC package 
true_training_roc <- 
  predict_models %>% 
  filter(training, model_name == "true_probs") %>% 
  roc(response = y, predictor = value)

true_validation_roc <- 
  predict_models %>% 
  filter(!training, model_name == "true_probs") %>% 
  roc(response = y, predictor = value)

firth_training_roc <- 
  predict_models %>% 
  filter(training, model_name == "firth") %>% 
  roc(response = y, predictor = value)

firth_validation_roc <- 
  predict_models %>% 
  filter(!training, model_name == "firth") %>% 
  roc(response = y, predictor = value)

roc_data <-
  bind_rows(tibble(model = "(truth)",
                   step = "training", 
                   sens = true_training_roc$sensitivities,
                   spec = true_training_roc$specificities, 
                   thresh = true_training_roc$thresholds),
            tibble(model = "(truth)",
                   step = "validation", 
                   sens = true_validation_roc$sensitivities,
                   spec = true_validation_roc$specificities, 
                   thresh = true_validation_roc$thresholds),
            tibble(model = "firth",
                   step = "training", 
                   sens = firth_training_roc$sensitivities,
                   spec = firth_training_roc$specificities, 
                   thresh = firth_training_roc$thresholds),
            tibble(model = "firth",
                   step = "validation", 
                   sens = firth_validation_roc$sensitivities,
                   spec = firth_validation_roc$specificities, 
                   thresh = firth_validation_roc$thresholds)) %>%
  mutate(model = factor(model, levels = c("(truth)", "firth"))) %>%
  group_by(model, step) %>% 
  mutate(annotate = ifelse(abs(0.5 - thresh) == min(abs(0.5 - thresh)) & step == "validation", TRUE, FALSE)) %>%
  ungroup() %>%
  arrange(model, step, desc(spec), sens);

roc_plot <- 
  ggplot(filter(roc_data),
         aes(x = 1 - spec,
             y = sens, 
             color = model, 
             linetype = step)) + 
  geom_path() + 
  geom_abline(intercept = 0, slope = 1) + 
  geom_label_repel(data = filter(roc_data, annotate),
                   aes(label = paste0("t = ", formatC(thresh, format = "f", digits = 1))),
                   nudge_x = -0.051,
                   nudge_y = 0.051,
                   size = 4.5) + 
  scale_x_continuous(name = "Spec", 
                     breaks = seq(0, 1, length = 11), 
                     labels = formatC(seq(1, 0, length = 11), format = "g", digits = 1), 
                     expand = expand_scale(add = 0.02)) + 
  scale_y_continuous(name = "Sens", 
                     breaks = seq(0, 1, length = 11), 
                     labels = formatC(seq(0, 1, length = 11), format = "g", digits = 1), 
                     expand = expand_scale(add = 0.02)) +
  scale_linetype_manual(name = "Step", 
                        values = c("dashed", "solid")) + 
  scale_color_manual(name = "Model",
                     values = model_colors[c(1,7)]) + 
  theme(text = element_text(size = 16), 
        legend.position = "top");
roc_plot;

(base ROC plot)

plot(firth_validation_roc)

Interpreting ROC curve

For the firth model:

  • For fixed specificity of 0.8, can plausibly expect sensitivity of about 0.45

  • If we classify based upon cutoff of \(\hat Y(x) > 0.5\) versus \(\hat Y(x) \leq 0.5\),we would expect sensitivity ~0.73 and specificity ~0.52

  • Achieving high sensitivity is difficult without sacrificing specificity

What does least ideal ROC curve look like?

Area under ROC curve (AUC or AUROC)

ROC curves are interesting if range of cutoffs are of interest

However, it has been shown that the area under the ROC curve is, in the case of logistic regression, the \(c\)-index:

\[\dfrac{\sum_{i,j:Y_i=0,Y_j=1} 1_{[\hat Y(x_i) < \hat Y(x_j)]}}{ \sum_{i,j:Y_i=0,Y_j=1} 1}\]

This estimates the probability of correctly ordering the risk of a randomly selected outcome and randomly selected non-outcome.

AUC

ggplot(filter(roc_data, model == "firth", step == "validation")) + 
  geom_path(aes(x = 1 - spec,# 
                y = sens)) + 
  geom_ribbon(aes(x = 1 - spec,
                  ymin = 0,
                  ymax = sens), 
              fill = "#777777") + 
  geom_abline(intercept = 0, slope = 1) + 
  geom_text(data = tibble(x = 0.8, y = 0.2, label = "AUC"),
            aes(x = x, 
                y = y, 
                label = label), 
            size = 10) + 
  scale_x_continuous(name = "Spec", 
                     breaks = seq(0, 1, length = 11), 
                     labels = formatC(seq(1, 0, length = 11), format = "g", digits = 1), 
                     expand = expand_scale(add = 0)) + 
  scale_y_continuous(name = "Sens", 
                     breaks = seq(0, 1, length = 11), 
                     labels = formatC(seq(0, 1, length = 11), format = "g", digits = 1), 
                     expand = expand_scale(add = 0)) +
  theme(text = element_text(size = 16), 
        legend.position = "none");

AOC = 1 - AUC

library(grid);library(jpeg);#
g <- rasterGrob(readJPEG("images/aoc.jpg"), width=unit(1,"npc"), height=unit(1,"npc"), interpolate = FALSE)
ggplot(filter(roc_data, model == "firth", step == "validation")) + 
  annotation_custom(g, -Inf, Inf, -Inf, Inf) + 
  geom_path(aes(x = 1 - spec, 
                y = sens)) + 
  geom_ribbon(aes(x = 1 - spec,
                  ymin = 0,
                  ymax = sens), 
              fill = "#777777") + 
  geom_abline(intercept = 0, slope = 1) + 
  geom_text(aes(x = 0.15, 
                y = .8, 
                label = "AOC"), 
            size = 10, 
            color = "#FFFFFF") +
  geom_text(aes(x = 0.8, 
                y = .2, 
                label = "AUC"), 
            size = 10) +
  scale_x_continuous(name = "Spec", 
                     breaks = seq(0, 1, length = 11), 
                     labels = formatC(seq(1, 0, length = 11), format = "g", digits = 1), 
                     expand = expand_scale(add = 0)) + 
  scale_y_continuous(name = "Sens", 
                     breaks = seq(0, 1, length = 11), 
                     labels = formatC(seq(0, 1, length = 11), format = "g", digits = 1),
                     expand = expand_scale(add = 0)) +
  theme(text = element_text(size = 16), 
        legend.position = "none");

Higher AUC is better

AUCs for ‘(true)’ and ‘firth’ in training step:

Area under the curve: 0.705
Area under the curve: 0.7375

AUCs for ‘(true)’ and ‘firth’ in validation step:

Area under the curve: 0.7235
Area under the curve: 0.6734

What is AUC under the intercept-only (null) model?

Summary thoughts on model assessment

Potential for optimism increases with inherent variability of model building method

Summary thoughts on model assessment

Be specific when saying the model was ‘validated’, and don’t assume others mean what they think they mean:

  • Does this refer to validation (model selection) or testing (model assessment)?
  • How was validation measured?
  • If cross-validation was involved, was it done properly?
  • Was the set of models considered sensible?
  • What population did validation and testing occur in?

Summary thoughts on model assessment

The effective region for improvement (difference between null and true models) is often narrow, especially in binary outcome models

References

Hastie, T., Tibshirani, R. and Friedman, J., 2009. The elements of statistical learning.

Steyerberg, E.W., Vickers, A.J., Cook, N.R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M.J. and Kattan, M.W., 2010. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology, 21(1), pp.128-138.